Fix binary search #3

sleepsort · 2016-10-02T16:20:40Z

Basically we cannot reuse exit status (quant_A, mid) as final result, as the following log shows, current logic will produce quantized vector with mean_err > target_error.

Since there are log(N) candidate nodes in the search tree, It would be simpler / easier if we just calculate it in the last pass.

Log before fix:
https://gist.github.com/sleepsort/1d0d1582825c6ea248eca19a0a471d1f

Log after fix:
https://gist.github.com/sleepsort/64ac49af24dc98eefa3ea8c4575ab3b6

To reproduce:

Before patching this commit, download additional patch with data:
https://gist.github.com/sleepsort/0585a47f42d200b1d445201e3d928474
Run command:
./model_quantization.py quant test.w2v.txt /tmp/test.w2v
Patch this commit
Rerun the command to compare.

ot

This was sort-of intentional, we just care of being close enough to the target. Because of the invariant described in the inline comments, the algorithm is correct, the "fix" would be to use "high" quantization rather than "mid". Of course then you need to recompute the results with that value.

It is not clear that your fix is correct.

ot · 2016-10-02T17:16:45Z

model_quantization.py

@@ -64,19 +64,23 @@ def quantize_array_lbda(A, lbda):
 def quantize_array_target(A, target_err):
    low = 1
    high = 128
+    mid = low


You don't need this

ot · 2016-10-02T17:17:03Z

model_quantization.py

-    while high - low > 1:
-        mid = (high + low) / 2
+    while low < high:
+        mid = low + (high - low) / 2


Python has arbitrary precision integers, no need to do this

ot · 2016-10-02T17:21:19Z

model_quantization.py

        quant_A, dequant_A = quantize_array(A, mid)
        mean_err = np.mean(np.sqrt(np.sum((dequant_A - A)**2, axis=-1)) / A_norms)
        logging.info("Binary search: q=%d, err=%.3f", mid, mean_err)

        if mean_err > target_err:
-            low = mid
+            low = mid + 1


This breaks the loop invariant that error(high) <= target_err < error(low)

ot · 2016-10-02T17:22:47Z

model_quantization.py

        else:
            high = mid
-
+    mid = low
+    quant_A, dequant_A = quantize_array(A, mid)


This is copypasta

sleepsort · 2016-10-02T19:22:54Z

Yeah, I understand that the result is always near the error boundary. Just to make sure the codes express the intention clearly.

And no, the algorithm is not right, simply returning high doesn't solve the problem, a simple counter-example is an array of the same number [0.1] * 128, the algorithm will return indice 1 instead of indice 0, which is surely wrong.

This error comes from the fact that you're using high - low > 1 as the position constraint. We cannot modify this to narrow the position, otherwise due to the int truncation, mid = (low + high) / 2 will always be low in some cases and will cause an infinite for loop

I wrote an example test to show that my algorithm is right (I can try to prove it later), and current codes have problems.

https://gist.github.com/sleepsort/708de5484d8484471ed33c4230790d39

ot · 2016-10-02T19:28:44Z

a simple counter-example is an array of the same number [0.1] * 128, the algorithm will return indice 1 instead of indice 0, which is surely wrong.

That doesn't satisfy the invariant I mentioned in the comment. What is the invariant of your algorithm?

sleepsort · 2016-10-02T19:48:33Z

That doesn't satisfy the invariant I mentioned in the comment

Ah I see, sorry I didn't understand the term invariant previously. I don't know in real case whether we'll have the same mean_err for some ranges, how about this?

[4, 3] target = 4

Just in theory, it is possible that we miss a better compression method by returning the indice 1 instead of 0. And this example breaks the invariant as well.

I might need some time to figure out the invariant in my algorithm... My intention is to make sure that error(low) > target_error will never happen, unless all the numbers in the array has error(i) > target_error, which already meets the exit condition. Can that be considered as an invariant?

ot · 2016-10-02T20:03:26Z

That still doesn't satisfy the invariant (in this case, that error(low) > target). I'm not sure you're familiar with standard algorithm correctness proofs, see for example https://en.wikipedia.org/wiki/Loop_invariant.

In the algorithm as it is now, we have:
Precondition: error(high) <= target < error(low)
Invariant: Same as precondition
Postcondition: same as precondition, plus high - low = 1

From the postcondition you can derive that low is the largest integer with error > target, and high is the smallest integer with error <= target. So if you want the best compression such that the error is <= target, just use high. So, why do you want to change binary search implementation?

sleepsort · 2016-10-02T20:15:57Z

Thanks for the link, I'll check it later.

Yes, your algorithm is right if the invariant is true. My point is that, we cannot always assert that the real data meets this invariant. For example edge cases sush as error(i) == error(j), and error(low) == target. The codes are vulnerable to those cases, which might happen in production.

And my initial intention was to make this binary search more general.

ot · 2016-10-02T20:26:42Z

For example edge cases sush as error(i) == error(j)

This is not an edge case, and the algorithm still works. About the preconditions, you can check them before running the search and decide what to do if they are not met: in this case it means that the initial range is not wide enough and we should just stop the algorithm, rather than trying to return an endpoint of the range.
I prefer using an algorithm that has obvious guarantees. What are the guarantees of your version? Can you prove them?

sleepsort · 2016-10-02T20:29:58Z

For example edge cases sush as error(i) == error(j)
This is not an edge case, and the algorithm still works.

Oh, to clarity, I meant it breaks the precondition instead of invariant.
Since the precondition is: error(high) <= target < error(low) ==> error(high) < error(low). error(high) == error(low) is the so-called edge case. Anyway, it can be pre-checked.

And I understand the difference in our preferences : ). My preference is on implementation that handles all possible corner cases (my understanding of general) and with less codes.

As my understanding, to implement your algorithm to make sure it is error-immune, the code will be like

// quantize & dequantize on low to make sure `target < error(low)` [1]
// quantize & dequantize on high to make sure `error(high) <= target` [2]
while (...) {
  // quantize & dequantize on mid to find optimal solution [3]
}
// quantize & dequantize on high to get result [4]

And my algorithm can remove [1], [2], and report exception of [2] when [4] is calculated. No matter what is the initial setting of low & high.

About the invariant in my algorithm, how about this?
Assumption: error(0) = target + 1, error(128) <= target
Precondition: error(low - 1) > target >= error(high)
Invariant: Same as precondition
Postcondition: same as precondition, plus low = high

First assumption is dummy, and exception handling of second assumption can be handled after the for loop, so the code looks like:

while (...) {
  // quantize & dequantize on mid to find optimal solution [3]
}
// quantize & dequantize on high to get result [4]
if (error(high) > target) {
  // exception handling
}
return high

And to explain the loop in my codes:

every time after f moves, assert(A[f-1] > target);
every time after t moves, assert(A[t] <= target), (In loop, assert(m < t), so t = m moves t)
outside while loop, assert(f == t) & assert(A[t-1] > target). The only case when A[t] <= target is not true, is t hasn't moved.

Fix binary search

cdc8635

ot requested changes Oct 2, 2016

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix binary search #3

Fix binary search #3

sleepsort commented Oct 2, 2016

ot left a comment

ot Oct 2, 2016

ot Oct 2, 2016

ot Oct 2, 2016

ot Oct 2, 2016

sleepsort commented Oct 2, 2016 •

edited

Loading

ot commented Oct 2, 2016

sleepsort commented Oct 2, 2016 •

edited

Loading

ot commented Oct 2, 2016

sleepsort commented Oct 2, 2016

ot commented Oct 2, 2016

sleepsort commented Oct 2, 2016 •

edited

Loading

Fix binary search #3

Are you sure you want to change the base?

Fix binary search #3

Conversation

sleepsort commented Oct 2, 2016

ot left a comment

Choose a reason for hiding this comment

ot Oct 2, 2016

Choose a reason for hiding this comment

ot Oct 2, 2016

Choose a reason for hiding this comment

ot Oct 2, 2016

Choose a reason for hiding this comment

ot Oct 2, 2016

Choose a reason for hiding this comment

sleepsort commented Oct 2, 2016 • edited Loading

ot commented Oct 2, 2016

sleepsort commented Oct 2, 2016 • edited Loading

ot commented Oct 2, 2016

sleepsort commented Oct 2, 2016

ot commented Oct 2, 2016

sleepsort commented Oct 2, 2016 • edited Loading

sleepsort commented Oct 2, 2016 •

edited

Loading

sleepsort commented Oct 2, 2016 •

edited

Loading

sleepsort commented Oct 2, 2016 •

edited

Loading